Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creates Structure for Upload Requests #61

Merged
merged 3 commits into from
Nov 17, 2023
Merged

Conversation

gbdubs
Copy link
Contributor

@gbdubs gbdubs commented Nov 16, 2023

The request is two step: (a) create the upload URLs and blobs, then after the assets are uploaded (b) kick off a process for parsing explicitly.

Sending the PR for the interface first - trying to get smaller ones going.

@gbdubs gbdubs requested a review from bcspragu November 16, 2023 17:04
Copy link
Collaborator

@bcspragu bcspragu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'm not sure I understand the API:

  1. Users request to upload N items
  2. Users get N upload URLs
  3. Users upload N files
  4. Users say "Hey, I uploaded " - One weakness here is that clients don't need to report the same number N of files here
  5. Users get back some ID? And the task starts running?

A more "bulletproof" API would be something like:

  1. Create a portfolio (or portfolio group, or process group whatever primitive files get attached to), get an ID back
  2. Request to upload files to that portfolio ID
  3. Can do step 2 as many times as is useful, can be batch or not batch, doesn't matter
  4. Mark files completed after they're uploaded. We could do this behind the scenes (as noted elsewhere, using blob storage events), but we could also just roll this validation step into step 5 below.
  5. Request to start the processing

From the web client's perspective, this whole process could be kicked off in one go once they select their files, it just makes the API slightly harder to misuse, intentionally or accidentally.

openapi/pacta.yaml Outdated Show resolved Hide resolved
CompletePortfolioUpload:
type: object
required:
- incomplete_upload_ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Confusing name, since they are completed from the user's perspective, they're done uploading. I'd just call them upload_ids, it's unambiguous here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep this as is - I think drift between terminology should be avoided when it's all internal only.

openapi/pacta.yaml Outdated Show resolved Hide resolved
openapi/pacta.yaml Outdated Show resolved Hide resolved
openapi/pacta.yaml Outdated Show resolved Hide resolved
@gbdubs
Copy link
Contributor Author

gbdubs commented Nov 17, 2023

LGTM, though I'm not sure I understand the API:

  1. Users request to upload N items
  2. Users get N upload URLs
  3. Users upload N files
  4. Users say "Hey, I uploaded " - One weakness here is that clients don't need to report the same number N of files here
  5. Users get back some ID? And the task starts running?

A more "bulletproof" API would be something like:

  1. Create a portfolio (or portfolio group, or process group whatever primitive files get attached to), get an ID back
  2. Request to upload files to that portfolio ID
  3. Can do step 2 as many times as is useful, can be batch or not batch, doesn't matter
  4. Mark files completed after they're uploaded. We could do this behind the scenes (as noted elsewhere, using blob storage events), but we could also just roll this validation step into step 5 below.
  5. Request to start the processing

From the web client's perspective, this whole process could be kicked off in one go once they select their files, it just makes the API slightly harder to misuse, intentionally or accidentally.

OK FWIW, I think this is mostly doing what you think it is, just with a few translations.

  • an Incomplete Upload is a file that has been uploaded, but hasn't parsed successfully, either because it failed parsing, or because it is currently processing, or hasn't yet been processed. This staging ground type of approach allows us to have every portfolio in the system correspond to a unit that is fit for any type of processing. The requirements specify things like deleting incomplete uploads by default, and building in that kind of logic over portfolios would have been riskier, etc. When an incomplete upload successfully parses into (one or more) portfolios, we delete the incomplete upload.
  • The "Marking as completed" is just the request to start processing - it's just Completes Initial DB Layer #5. There is no Initial Audit Log DB Layer + Tests #4.
  • The reason I think it is stronger to request the set of incomplete uploads to process is because if the incomplete upload fails (i.e. file too big and rejected by cloud storage, ex), the client can be smarter about how it handles partial success cases on the frontend, with no requirement for storage on the backend (i.e. we would have to record more concretely the ties between which blobs were uploaded and which were associated with a non-started task, ex, if we didn't specify which things to run when we say "start the run".

Does that make sense?

@gbdubs gbdubs requested a review from bcspragu November 17, 2023 17:12
@gbdubs gbdubs merged commit b714121 into main Nov 17, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants